OrangeExpDB: an integrative gene expression database for Citrus spp.

Background Citrus is a major fruit crop, and RNA-sequencing (RNA-seq) data can be utilized to investigate its gene functions, heredity, evolution, development, and the detection of genes linked to essential traits or resistance to pathogens. However, it is challenging to use the public RNA-seq datasets for researchers without bioinformatics training, and expertise. Results OrangeExpDB is a web-based database that integrates transcriptome data of various Citrus spp., including C. limon (L.) Burm., C. maxima (Burm.) Merr., C. reticulata Blanco, C. sinensis (L.) Osbeck, and Poncirus trifoliata (L.) Raf., downloaded from the NCBI SRA database. It features a blast tool for browsing and searching, enabling quick download of expression matrices for different transcriptome samples. Expression of genes of interest can be easily generated by searching gene IDs or sequence similarity. Expression data in text format can be downloaded and presented as a heatmap, with additional sample information provided at the bottom of the webpage. Conclusions Researchers can utilize OrangeExpDB to facilitate functional genomic analysis and identify key candidate genes, leveraging publicly available citrus RNA-seq datasets. OrangeExpDB can be accessed at http://www.orangeexpdb.com/. Supplementary Information The online version contains supplementary material available at 10.1186/s12864-024-10445-5.


Background
Belonging to the Rutaceae family, the Aurantioideae subfamily encompasses a variety of species, including Citrus sinensis and its related genera [1].Oranges and their products are greatly appreciated for their nutritional, economic, and cultural advantages.Orange juice is one of the most beloved drinks [2].It is widely known that tangerine peels have healing properties for medical purpose [3].Fossil records from the late Miocene epoch in Lincang city, Yunnan province of China, suggest that a progenitor of the Citrus spp.have evolved approximately 8 million years ago [4].Citrus plants, including oranges, lemons, and mandarins, are grown in more than 140 countries [5,6].Citrus growers are highly concerned about the outbreaks of citrus diseases, including Huanglongbing (HLB), which is the most devastating citrus disease [7].P. trifoliata is often used as rootstock for C. sinensis and has displayed a certain level of tolerance to HLB [8].Unfortunately, Candidatus Liberibacter asiaticus, Ca.L. africanus, and Ca.L. americanus, which are the causal agent of HLB, are yet to be cultured and there are no HLB-resistant citrus varieties [9].The inability to culture the HLB pathogens renders it difficult to investigate the pathogenesis [10].In addition, various issues need addressing, including yield, flavor, ripening time, stress resistance or tolerance, and mutation [11].With the rapid progress of Next Generation Sequencing (NGS) technology, a vast quantity of Citrus transcriptome data has been collected, offering researchers the opportunity to address many challenging questions.
In the past decade, the advancement of NGS has generated a large amount of sequence data [12].By 2023, more than 2,400 samples of Citrus spp.have been published in the NCBI Sequence Read Archive (SRA) database [13,14].In recent years, RNA-seq technology has been increasingly used to investigate various aspects of Citrus spp., such as gene expression changes caused by HLB [15][16][17][18][19][20].Researchers have utilized this technique to build co-expression networks for analyzing core transcription factors in citrus development and stress responses [21,22].However, raw data for RNA-seq is scattered across multiple databases, including SRA, European Nucleotide Archive (ENA), and Genome Sequence Archive (GSA), making access difficult [23].Furthermore, the data is often fragmented.As data growth continue to accelerate, there is a need for a standardized and simplified method of accessing gene expression data [24].
Recently, numerous online databases have been established, including CPBD [25], NGDC [26], TeaPGDB [27], BarleyExpDB [24], PlantcircBase [28], GRooT [29], PHIbase [30], and MPDB [31].Notebly, these databases not only store data, but also offer various tools for analyzing biological data, such as BLAST [24,25].TeaPGDB, for example, is a user-friendly platform for tea plant genome, providing access to seven tea genome sequences and five tool sets, including "Gene Search", "BLAST", "JBrowse", "SSR" and "Download" [27].PlantcircBase is a database that consolidates all plant-circRNA data [28].Importantly, BarleyExpDB contains transcriptional profiles of barley across various growth and developmental stages, tissues, and stress conditions [24].These databases are useful for analyzing the intricate regulatory mechanisms of various organisms.Despite the availability of multiple genomes of available.sinensis, a comprehensive and centralized database of RNA-seq datasets for Citrus spp. is still lacking.
In this study, we have created OrangeExpDB, a database containing transcriptome data from 1,638 samples of five citrus species including C. limon, C. maxima, C. reticulata, C. sinensis, and P. trifoliata (Table 1).The expression profiles of genes in various tissues, developmental stages, and stress conditions can be easily downloaded and utilized.OrangeExpDB empowers researchers to access citrus RNA-seq data, thus facilitating the subsequent study of critical questions related to citrus.

Species selected
Our database includes five species of Citrus: C. sinensis, C. limon, C. maxima, C. reticulata, and P. trifoliata.C. maxima, C. sinensis, C. limon, and C. reticulata are severely affected by HLB [32][33][34][35], whereas P. trifoliata is often used as a rootstock and shows tolerance to HLB [8].Therefore, we chose the five representative citrus species to construct the database.

BioProjects and BioSamples
OrangeExpDB was created by collecting RNA-seq data from five species, resulting in a total of 134 studies (Table 1).C. sinensis had the highest number of studies at 61, followed by C. maxima with 28, C. reticulata with 19, P. trifoliata with 15, and C. limon with 11.The datasets for each species were categorized into several groups based on stages/tissues, mutants, and stress treatments.

Collection and option
To streamline the screening process, we compiled information from NCBI and relevant literature, including project names, sample names, and library names.We also renamed some samples, of which the names were with unclear or ambiguous meanings, after checking relevant literature.Descriptions of each project were obtained.
The RNA-seq raw data was downloaded from the NCBI SRA database and converted to fastq format using SRA toolkit v2.10.9 [36], resulting in a total of 1,638 samples (Table 1 and Supplementary Table 1).Adaptor and lowquantity sequences in the fastq files were removed using trimmomatic v0.39 [37].HISAT2 was used to build the index for genomic assembly and comparison of RNAseq reads.The resulting file was in SAM format, which was processed using samtools v1.13 with parameters 'bS' and 'sort' [38].Finally, stringtie v2.2.1 and a self-written Python script (https://github.com/Viper-Chang/Batchanalysis-of-transcriptome-data)were used to extract the expression matrix of FPKM [39].In order to provide convenience to users, we have also uploaded the TPM values of all genes for each species in our database at "DOWN-LOAD" page.The heatmap on the webpage was created using plotly [40].The matrix was stored, maintained and operated using MySQL v5.6.50 (Fig. 1).

Database commons and interface
Our web server is hosted on Tencent Cloud's lightweight application server, which is equipped with four Intel(R) Xeon(R) Platinum 8255 C CPUs clocked at 2.50 GHz and 8 GB of RAM.Access to the website is free of charge, as our purpose is not commercial.The Operating System (OS) running on the server is CentOS v7.9 (http:// www.centos.org), a Linux-based OS.The web interface was designed using HTML (https://www.w3.org/ html/), JavaScript (https://www.javascript.com/)and CSS (http://www.w3.org).The server-side back-end was encoded using PHP, and scripts were written in PHP to search data from MySQL and retrieve it to the front-end (Fig. 1).).Each study is provided with a tag containing a summary of the study (Fig. 2A).A search box allows users to query gene expression of interest, accommodating up to 500 genes at a time.For more than 500 genes, users can submit multiple queries or download the raw data to extract the desired information (Fig. 2B).

Introduction
OrangeExpDB provides a brief introduction and a dropdown menu for browsing the "Materials and Methods" used to construct the database.Users can access the analysis tools, t commands and parameters.Additionally, the interface provides a comprehensive description of each RNA-seq study, including sample accession number, stages/tissues, treatments, and other data (Fig. 2C).

Blast tool
OrangeExpDB offers an online BLAST service for identifying genes with only sequence fragments and no gene Fig. 1 The procedure for creating the OrangeExpDB database.Raw sequencing data of RNA-seq studies were obtained and filtered, then aligned with the reference genome, resulting in a gene expression matrix.Gene functional information was subsequently annotated and functional modules such as "Blast", "Search" and "Download" were added to the OrangeExpDB website IDs.Users can submit sequences in Fasta format, including amino acid and nucleotide sequences, or upload them in a text file format.Five BLAST algorithms (e.g., BLASTN, BLASTP, and TBLASTX) are available to identify possible homologous sequences.Results are displayed in order with the top candidates presented sideby-side for easy comparison (Fig. 2D).

Downloads
On the "Download" page, users can download or re-analyze the matrix of FPKM and TPM values of interested BioProjects or BioSamples (Fig. 2E).

About
The authors who contributed to the design and construction of the database are featured on the "About" page (Fig. 2F).Additionally, generic external links are accessible for further information.

Links
The databases utilized in this study can be found on the "Links" page.

An example for users
To facilitate the use of the database, a straightforward example has been created to extract gene expression matrices of interest from selected BioSamples (Fig. 3).Users can select one of the five citrus Latin names displayed on the home page (Fig. 3A).Then, they can enter the identifiers of the desired gene locus, choose the category of the BioProjects, select the relevant BioSamples and submit (Fig. 3B).The results page displays detailed gene locus and BioProjects information, along with a download link containing the expression values of the specified genes from the selected BioSamples and a heatmap (Fig. 3C).Detailed information of the BioSamples including relevant publications, experiment accession, genotype/phenotype, stage/tissue and sequencing platform is presented at the bottom of the page (Fig. 3D).

Prospects
OrangeExpDB is a dynamic database that offers convenient access to gene expression data for various citrus species.It will be regularly updated with the latest genomic information to ensure the accuracy of the expression matrix for each species.OrangeExpDB is designed to accommodate growing data and can be easily expanded.In future updates, the database will also include features for identifying RNA-editing sites and integrating single-cell RNA sequencing (scRNA-seq) data.Python scripts will be provided to simplify usage, and contributions from external groups and individuals are encouraged.

Conclusions
OrangeExpDB is a comprehensive web-accessible database of RNA-seq data for citrus plants.It enables users to quickly search for information using known gene IDs, as well as providing expression levels of various tissues, developmental stages, and stresses.Additionally, the database provides useful tools such as function annotation, visualization, and result downloading.Orange-ExpDB is a valuable resource for researchers looking to access and utilize transcriptome.

Fig. 2
Fig. 2 An overview of OrangeExpDB database.A and B Homepage of OrangeExpDB.C Introduction of OrangeExpDB.D Blast tools of OrangeExpDB.E Download page of OrangeExpDB.F Contact information and relevant hyperlinks

Fig. 3 A
Fig. 3 A demonstration of the usage of OrangeExpDB database.A Selection of interested species.B Each option for generating the expression value of candidate genes.C An example of the result page.D The sample information in the result page

Table 1
Statistics of BioProjects and BioSamples of Citrus spp. in this study